Loklak - A Distributed Crawler and Data Harvester for Overcoming Rate Limits
نویسندگان
چکیده
Modern social networks have become sources for vast quantities of data. Having access to such big data can be very useful for various researchers and data scientists. In this paper we describe Loklak, an open source distributed peer to peer crawler and scraper for supporting such research on platforms like Twitter, Weibo and other social networks. Social networks such as Twitter and Weibo pose various limitations to the user on the rate at which one could freely collect such data for research. Our crawler enables researchers to continuously collect data while overcoming the barriers of authentication and rate limits imposed to provide a repository of open data as a service.
منابع مشابه
Design and Implementation of a High-Performance Distributed Web Crawler
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits m...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملInkjet 3D Printed Miniature Water Turbine Energy Harvester-Flow Meter for Distributed Measurement Systems
An energy harvester fabricated by inkjet 3D printing based on miniature water flow turbine with mechanical and electrical components necessary for electrical energy generation is shown. Turbine (outside diameter 25.4 mm (1”) and 14 mm length) is able to generate electric power of 4 mW for 14 L/min of water flow and it gives 0.7 V for optimal load resistance (RL = 55 Ω). The harvester itself may...
متن کاملUsing Regression based Control Limits and Probability Mixture Models for Monitoring Customer Behavior
In order to achieve the maximum flexibility in adaptation to ever changing customer’s expectations in customer relationship management, appropriate measures of customer behavior should be continually monitored. To this end, control charts adjusted for buyer’s/visitor’s prior intention to repurchase or visit again are suitable means taking into account the heterogeneity across customers. In the ...
متن کاملMulti-core Processor Using Virtualization
A Web crawler is an important component of the Web search engine. It demands large amount of hardware resources (CPU and memory) to crawl data from the rapidly growing and changing Web. So that the crawling process should be a continuous process performed from time-to-time to maintain up-to-date crawled data. This paper develops and investigates the performance of a new approach to speed up the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1704.03624 شماره
صفحات -
تاریخ انتشار 2017